Titanic Survival Prediction Project

  1. Objective:

    The primary goal of this project is to predict the survival of passengers onboard the Titanic using machine learning techniques.

    By analyzing and interpreting the available dataset, we aim to build a reliable model that can accurately predict whether a passenger survived or did not survive the disaster based on various features, such as age, gender, class, and more.

    The insights gained from this project can help us understand the factors that contributed to a passenger's survival and provide a historical perspective on the tragedy.

Let's take a look the variables in this dataset.

Variables Definition Key Type
Survived Survival 0 = No, 1 = Yes int
pclass Ticket class 1 = 1st, 2 = 2nd, 3= 3rd int
sex Sex object
Age Age in years Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 float64
sibsp # of siblings / spouses aboard the Titanic Sibling = brother, sister, stepbrother, stepsister ; Spouse = Husband, wife (mistresses and fiances were ignored) int
parch # of parents / children aboard the Totanic Parent = mother, father ; Child = daughter, son, stepdaughter, stepson ; Some children travelled only with a nanny, therefore parch= 0 for them. int
ticket Ticket number object
fare Passenger fare float
cabin Cabin number object
embarked Port of Embarkation C = Cherbourg, Q= Queenstown, S = Southampton object

Now, start Data Exploration and Analysis.

In [54]:
### Import Libaray ###
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
from matplotlib.cm import rainbow
import seaborn as sns
import plotly.io as pio
pio.renderers.default = 'notebook'

print('Done')
Done
In [55]:
train_data = pd.read_csv('train.csv') #Read 'train.csv' and 'test.csv' into a pandas dataframe
test_data = pd.read_csv('test.csv')
In [56]:
train_data.head(5)
Out[56]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [57]:
train_data.tail(5)
Out[57]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q
In [58]:
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [59]:
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB
In [60]:
train_data.isnull().sum()
Out[60]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
In [61]:
test_data.isnull().sum()
Out[61]:
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64
In [62]:
# Nan value in 'Embarked' replaced by mode
mode_Embarked = train_data['Embarked'].mode().iloc[0]
train_data['Embarked'].fillna(mode_Embarked, inplace=True)

Given that the 'cabin' variable has 687 missing values out of 891 rows, which corresponds to approximately 77% of the data, it can be challenging to accurately impute the missing values. In this case, it is reasonable to consider dropping the 'cabin' feature from the dataset.

However, before making a decision, it is essential to assess whether the 'cabin' variable holds any valuable information that might contribute to the prediction of passenger survival. For example, it is possible that certain cabin areas had easier access to lifeboats, which could be a significant factor in survival prediction.

In [63]:
# Extract the first letter of the cabin to represent the cabin section
train_data['CabinSection'] = train_data['Cabin'].apply(lambda x: x[0] if pd.notnull(x) else x)

# Calculate the survival rate for each cabin section
cabin_survival = train_data.groupby('CabinSection')['Survived'].mean().sort_values(ascending=False)

# Plot the survival rate for each cabin section
plt.figure(figsize=(10, 5))
sns.barplot(x=cabin_survival.index, y=cabin_survival.values)
plt.xlabel('Cabin Section')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Cabin Section')
plt.show()

The distribution of survival rates across cabin sections indicates that there may be some relationship between the cabin location and the survival rate of passengers.

It seems that passengers in sections D, E, and B had higher chances of survival, while those in sections A and T had lower chances.

This information could potentially be useful for our predictive model.

Given this observed relationship, it is worth considering keeping the 'CabinSection' feature.

In [64]:
train_data['CabinSection'].fillna('Unknown', inplace=True)
train_data.head(5)
Out[64]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked CabinSection
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S Unknown
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S Unknown
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S C
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S Unknown
In [65]:
# Group the data by 'Pclass' and 'Sex' and calculate the median age for each group
age_medians = train_data.groupby(['Pclass', 'Sex'])['Age'].median()

# Define a function to impute age based on 'Pclass' and 'Sex'
def impute_age(row):
    if pd.isnull(row['Age']):
        return age_medians[row['Pclass'], row['Sex']]
    else:
        return row['Age']

# Apply the function to the dataset
train_data['Age'] = train_data.apply(impute_age, axis=1)

Let's do same thing to test set¶

In [66]:
# Group the data by 'Pclass' and 'Sex' and calculate the median age for each group
test_age_medians = test_data.groupby(['Pclass', 'Sex'])['Age'].median()

# Define a function to impute age based on 'Pclass' and 'Sex'
def impute_age(row):
    if pd.isnull(row['Age']):
        return test_age_medians[row['Pclass'], row['Sex']]
    else:
        return row['Age']

# Apply the function to the dataset
test_data['Age'] = test_data.apply(impute_age, axis=1)
In [67]:
# Extract the first letter of the cabin to represent the cabin section
test_data['CabinSection'] = test_data['Cabin'].apply(lambda x: x[0] if pd.notnull(x) else x)

test_data['CabinSection'].fillna('Unknown', inplace=True)
In [68]:
train_data.hist(figsize=(10, 8))
Out[68]:
array([[<Axes: title={'center': 'PassengerId'}>,
        <Axes: title={'center': 'Survived'}>,
        <Axes: title={'center': 'Pclass'}>],
       [<Axes: title={'center': 'Age'}>,
        <Axes: title={'center': 'SibSp'}>,
        <Axes: title={'center': 'Parch'}>],
       [<Axes: title={'center': 'Fare'}>, <Axes: >, <Axes: >]],
      dtype=object)
In [69]:
# Embarked, Cabin Section, Pclass, Sex, VS Survivied
Key_df = train_data[['Sex','Embarked','Pclass','CabinSection','Survived']]
In [70]:
Key_df.loc[:, 'Survived'] = Key_df['Survived'].map({1: "Yes", 0: "No"})
Key_df.loc[:, 'Pclass'] = Key_df['Pclass'].map({1: 'Upper', 2: 'Middle', 3: 'Lower'})
Key_df.loc[:, 'Embarked'] = Key_df['Embarked'].map({'C': 'Cherbourh', 'Q': 'Queenstown', 'S': 'Southampton'})
In [71]:
Key_df.head()
Out[71]:
Sex Embarked Pclass CabinSection Survived
0 male Southampton Lower Unknown No
1 female Cherbourh Upper C Yes
2 female Southampton Lower Unknown Yes
3 female Southampton Upper C Yes
4 male Southampton Lower Unknown No
In [72]:
#Survival rate group by embarked
Key_df.groupby('Embarked')['Survived'].value_counts(normalize=True).unstack().sort_values(by='Yes', ascending=False)
Out[72]:
Survived No Yes
Embarked
Cherbourh 0.446429 0.553571
Queenstown 0.610390 0.389610
Southampton 0.660991 0.339009
In [73]:
#Survival rate group by Pclass
Key_df.groupby('Pclass')['Survived'].value_counts(normalize=True).unstack().sort_values(by='Yes', ascending=False)
Out[73]:
Survived No Yes
Pclass
Upper 0.370370 0.629630
Middle 0.527174 0.472826
Lower 0.757637 0.242363
In [74]:
#Survival rate group by Sex
Key_df.groupby('Sex')['Survived'].value_counts(normalize=True).unstack().sort_values(by='Yes', ascending=False)
Out[74]:
Survived No Yes
Sex
female 0.257962 0.742038
male 0.811092 0.188908
In [75]:
#Survival rate group by CabinSection
Key_df.groupby('CabinSection')['Survived'].value_counts(normalize=True).unstack().sort_values(by='Yes', ascending=False)
Out[75]:
Survived No Yes
CabinSection
D 0.242424 0.757576
E 0.250000 0.750000
B 0.255319 0.744681
F 0.384615 0.615385
C 0.406780 0.593220
G 0.500000 0.500000
A 0.533333 0.466667
Unknown 0.700146 0.299854
T 1.000000 NaN
In [76]:
#CabinSection "unknown" count
Key_df['CabinSection'].value_counts()
Out[76]:
CabinSection
Unknown    687
C           59
B           47
D           33
E           32
A           15
F           13
G            4
T            1
Name: count, dtype: int64
In [77]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set the style for the plots
sns.set(style='whitegrid')

# Create a barplot for the number of survivors by 'Embarked'
plt.figure(figsize=(8, 5))
sns.countplot(x='Embarked', hue='Survived', data=Key_df)
plt.xlabel('Embarked')
plt.ylabel('Number of Passengers')
plt.title('Number of Survivors by Embarked')
plt.show()

# Create a barplot for the number of survivors by 'Pclass'
plt.figure(figsize=(8, 5))
sns.countplot(x='Pclass', hue='Survived', data=Key_df)
plt.xlabel('Pclass')
plt.ylabel('Number of Passengers')
plt.title('Number of Survivors by Pclass')
plt.show()

# Create a barplot for the number of survivors by 'Sex'
plt.figure(figsize=(8, 5))
sns.countplot(x='Sex', hue='Survived', data=Key_df)
plt.xlabel('Sex')
plt.ylabel('Number of Passengers')
plt.title('Number of Survivors by Sex')
plt.show()

# Create a barplot for the number of survivors by 'CabinSection'
plt.figure(figsize=(10, 5))
sns.countplot(x='CabinSection', hue='Survived', data=Key_df)  # Use train_data since we created 'CabinSection' in train_data
plt.xlabel('Cabin Section')
plt.ylabel('Number of Passengers')
plt.title('Number of Survivors by Cabin Section')
plt.show()

Embarked:¶

A majority of the passengers embarked from Southampton, but the survival rate was lower than the other ports.

Passengers who embarked from Cherbourg had a higher survival rate compared to the other ports, with the number of survivors exceeding the number of non-survivors.

Queenstown had a relatively balanced number of survivors and non-survivors, indicating no significant trend in survival rate.

Pclass:¶

Passengers in the Lower class (Pclass = 'Lower') had the lowest survival rate, with a significantly higher number of non-survivors compared to survivors.

Passengers in the Upper class (Pclass = 'Upper') had the highest survival rate, with more survivors than non-survivors.

The Middle class had a more balanced number of survivors and non-survivors, indicating a less pronounced impact on survival rate compared to the other classes.

Sex:¶

Female passengers had a much higher survival rate than male passengers. The number of female survivors is more than twice the number of non-survivors.

Male passengers had a significantly lower survival rate, with the number of non-survivors being more than four times the number of survivors.

In [78]:
Age_fig = make_subplots(rows=1,cols=1,specs=[[{"type":"histogram"}]])

Age_fig.add_trace(
    go.Histogram(
        x = train_data['Age'].where(train_data['Survived']==1),
        name = 'Yes',
        marker={'color':'white'},
        nbinsx=10
    ),
    row=1,col=1
)

Age_fig.add_trace(
    go.Histogram(
        x = train_data['Age'].where(train_data['Survived']==0),
        name = 'No',
        marker={'color':'red'},
        nbinsx=10
    ),
    row=1,col=1
)

Age_fig.update_layout(title_text="Age survive distribution",template='plotly_dark', width=500)

Age_fig.show()
In [79]:
# 0-9 years old, 10-19 years old, 20-29 years old, 30-39 years old, 40-49 years old, 50-59 years old, 60-69 years old, 70-79 years old, 80-89 years old. The pivot table of the survival rate of each age group.
AgeGroup = pd.cut(train_data['Age'], bins=[0, 9, 19, 29, 39, 49, 59, 69, 79, 89])
train_data.pivot_table(index=AgeGroup, values='Survived', aggfunc=np.mean).sort_values(by='Survived', ascending=False)
Out[79]:
Survived
Age
(79, 89] 1.000000
(0, 9] 0.612903
(29, 39] 0.454054
(49, 59] 0.416667
(9, 19] 0.401961
(39, 49] 0.354545
(59, 69] 0.315789
(19, 29] 0.315642
(69, 79] 0.000000

Most people are 20-29 years old.

kid seem have high chance to survive. Teenager, middle age have high chance not survive.

Now that having a good understanding of this dataset, let's move on to feature engineering and feature selection.¶

In [80]:
# Create a new dataframe that drop cabin
Clean_train_data = train_data.drop('Cabin', axis=1).copy()
In [81]:
#Combine SibSp and Parch to get a 'FamilySize' feature
Clean_train_data['FamilySize'] = Clean_train_data['SibSp'] + Clean_train_data['Parch'] + 1

#Create a 'IsAlone' feature based on 'FamilySize':
Clean_train_data['IsAlone'] = Clean_train_data['FamilySize'].apply(lambda x: 1 if x == 1 else 0)
In [82]:
# Show the relationship between # of siblings / spouses aboard the Titanic and survive rate

Family_fig = make_subplots(rows=1,cols=1,specs=[[{"type":"histogram"}]])

Family_fig.add_trace(
    go.Histogram(
        x = Clean_train_data['FamilySize'].where(Clean_train_data['Survived']==1),
        name = 'Yes',
        marker={'color':'white'},
        nbinsx=20
    ),
    row=1,col=1
)

Family_fig.add_trace(
    go.Histogram(
        x = Clean_train_data['FamilySize'].where(Clean_train_data['Survived']==0),
        name = 'No',
        marker={'color':'red'},
        nbinsx=20
    ),
    row=1,col=1
)

# Fig Width = 500, Height = 400
Family_fig.update_layout(title_text="Family survive distribution",template='plotly_dark', width=500, height=400)

Family_fig.show()
In [83]:
# Count of each family size
Clean_train_data['FamilySize'].value_counts()
Out[83]:
FamilySize
1     537
2     161
3     102
4      29
6      22
5      15
7      12
11      7
8       6
Name: count, dtype: int64
In [84]:
# Survival rate of each family size
Clean_train_data.pivot_table(index='FamilySize', values='Survived', aggfunc=np.mean).sort_values(by='Survived', ascending=False)
Out[84]:
Survived
FamilySize
4 0.724138
3 0.578431
2 0.552795
7 0.333333
1 0.303538
5 0.200000
6 0.136364
8 0.000000
11 0.000000

1-person families have a lower survival rate on board

2-4 person families have a higher survival rate

Families with more than 4 people have a lower survival rate

In [85]:
TransData = Clean_train_data.copy()
# convert categorical variables into numerical representations
TransData['Sex'] = TransData['Sex'].map({'male': 0, 'female': 1})
# One-hot encoding for 'Embarked' and 'CabinSection'
TransData = pd.get_dummies(TransData, columns=['Embarked', 'CabinSection'])
TransData.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)
In [86]:
df_corr = TransData.corr()
plt.figure(figsize=(20, 15))
sns.heatmap(round(df_corr, 2), annot=True, cmap="mako")
Out[86]:
<Axes: >

Update the test set as same as train set¶

In [87]:
Clean_test_data = test_data.drop('Cabin', axis=1).copy()
#Combine SibSp and Parch to get a 'FamilySize' feature
Clean_test_data['FamilySize'] = Clean_test_data['SibSp'] + Clean_test_data['Parch'] + 1

#Create a 'IsAlone' feature based on 'FamilySize':
Clean_test_data['IsAlone'] = Clean_test_data['FamilySize'].apply(lambda x: 1 if x == 1 else 0)

# convert categorical variables into numerical representations
Clean_test_data['Sex'] = Clean_test_data['Sex'].map({'male': 0, 'female': 1})
# One-hot encoding for 'Embarked' and 'CabinSection'
Clean_test_data = pd.get_dummies(Clean_test_data, columns=['Embarked', 'CabinSection'])
In [88]:
Clean_test_data['Fare'].fillna(Clean_test_data['Fare'].mean(),inplace=True)
In [89]:
Clean_test_data.isnull().sum()
Out[89]:
PassengerId             0
Pclass                  0
Name                    0
Sex                     0
Age                     0
SibSp                   0
Parch                   0
Ticket                  0
Fare                    0
FamilySize              0
IsAlone                 0
Embarked_C              0
Embarked_Q              0
Embarked_S              0
CabinSection_A          0
CabinSection_B          0
CabinSection_C          0
CabinSection_D          0
CabinSection_E          0
CabinSection_F          0
CabinSection_G          0
CabinSection_Unknown    0
dtype: int64

Next, we will identified as potentially significant for building a predictive model for passenger survival by using RFE with logistic regression¶

In [90]:
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler

# Preprocess the dataset (drop the 'Survived' target variable and non-numeric columns)
X = TransData.drop(['Survived'], axis=1)
y = TransData['Survived']

# Standardize the numeric features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create a logistic regression model
logreg = LogisticRegression(solver='liblinear')

# Recursive feature elimination with cross-validated selection
rfecv = RFECV(estimator=logreg, step=1, cv=StratifiedKFold(5), scoring='accuracy')
rfecv.fit(X_scaled, y)

print("Optimal number of features: %d" % rfecv.n_features_)

# Plot the number of features versus cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross-validation score (accuracy)")
plt.plot(range(1, len(rfecv.cv_results_['mean_test_score']) + 1), rfecv.cv_results_['mean_test_score'])
plt.show()


# Get the selected features
selected_features = X.columns[rfecv.support_]
print("Selected features:", selected_features)
Optimal number of features: 15
Selected features: Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Fare', 'FamilySize', 'IsAlone',
       'Embarked_S', 'CabinSection_C', 'CabinSection_D', 'CabinSection_E',
       'CabinSection_F', 'CabinSection_G', 'CabinSection_T',
       'CabinSection_Unknown'],
      dtype='object')
In [91]:
Select_Data = TransData[['Pclass', 'Sex', 'Age', 'SibSp', 'Fare', 'FamilySize', 'IsAlone',
       'Embarked_S', 'CabinSection_C', 'CabinSection_D', 'CabinSection_E',
       'CabinSection_F', 'CabinSection_G',
       'CabinSection_Unknown']].copy()
In [92]:
Test_Select = Clean_test_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Fare', 'FamilySize', 'IsAlone',
       'Embarked_S', 'CabinSection_C', 'CabinSection_D', 'CabinSection_E',
       'CabinSection_F', 'CabinSection_G',
       'CabinSection_Unknown']].copy()

That is the features that will use to modeling.¶

In [93]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.neural_network import MLPClassifier
In [94]:
#Split the data into training and validation sets
from sklearn.model_selection import train_test_split

X = Select_Data
y = TransData['Survived']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
In [95]:
# Preprocess the data

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
In [96]:
models = [
    ('Logistic Regression', LogisticRegression(solver='liblinear')),
    ('KNN', KNeighborsClassifier()),
    ('Support Vector Machines', SVC()),
    ('Naive Bayes', GaussianNB()),
    ('Decision Tree', DecisionTreeClassifier()),
    ('Random Forest', RandomForestClassifier(n_estimators=100)),
    ('Perceptron', Perceptron()),
    ('Artificial Neural Network', MLPClassifier(max_iter=2000))
]
In [97]:
results = []
names = []

for name, model in models:
    cv_scores = cross_val_score(model, X_scaled, y, cv=5, scoring='accuracy')
    results.append(cv_scores)
    names.append(name)
    print(f"{name}: {cv_scores.mean():.3f} ({cv_scores.std():.3f})")

plt.figure(figsize=(10, 5))
plt.boxplot(results)
plt.xticks(range(1, len(names) + 1), names, rotation=45)
plt.ylabel('Accuracy')
plt.title('Model Comparison Using Cross-Validation')
plt.show()
Logistic Regression: 0.799 (0.014)
KNN: 0.785 (0.024)
Support Vector Machines: 0.800 (0.019)
Naive Bayes: 0.468 (0.115)
Decision Tree: 0.787 (0.045)
Random Forest: 0.801 (0.028)
Perceptron: 0.749 (0.047)
Artificial Neural Network: 0.802 (0.044)

The ANN model has the highest mean accuracy (0.802) among the compared models. However, it also has a relatively higher standard deviation (0.044) compared to some of the other models.

The Support Vector Machines (SVM) model seem is a good choice also, due to its high accuracy and lower standard deviation.

We will choose ANN model here.¶

In [98]:
# Preprocess the training dataset (drop the 'Survived' target variable and non-numeric columns)
X_train = Select_Data
y_train = TransData['Survived']

# Standardize the numeric features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Create and fit the ANN model
ann = MLPClassifier(max_iter=1000)
ann.fit(X_train_scaled, y_train)
Out[98]:
MLPClassifier(max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(max_iter=1000)
In [99]:
# Preprocess the test dataset
X_test = Test_Select

# Standardize the numeric features using the same scaler fitted on the training data
X_test_scaled = scaler.transform(X_test)
In [100]:
# Predict the survival probabilities for the test dataset
y_test_proba = ann.predict_proba(X_test_scaled)

# Predict the survival outcomes for the test dataset
y_test_pred = ann.predict(X_test_scaled)
In [101]:
output = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Survived': y_test_pred})
output.to_csv('predictions.csv', index=False)